Annual  Conference  of  the  Prognostics  and  Health  Management  Society  2014 


A  Probabilistic  Approach  for  Reliability  and  Life  Prediction  of 
Electronics  in  Drilling  and  Evaluation  Tools 


Amit  A.  Kale\  Katrina  Carter-Journet^,  Troy  A.  Falgout^,  Ludger  Heuermann-Kuehn"^,  Derick  Zurcher^ 


1,2, 3, Incorporated,  Houston,  Texas,  77379,  USA 

amit.  kale  @  bakerhushes.  com 
katrina.  carter-ioumet@bakerhushes.  com 

ludger.  heuermann  @  bakerhushes.  com 

derick.zurcher@  bakerhushes.  com 

troy.  falsout@bakerhushes.  com 


Abstract 

The  capability  to  predict  performance  and  lifetime  of 
drilling  electronics  is  the  key  to  preventing  costly  downhole 
tool  failures  and  ensuring  success  of  any  drilling  operation. 
Drilling  electronics  operate  under  extremely  harsh 

downhole  environments  with  temperatures  beyond  150C 
and  vibration  levels  exceeding  15g.  In  addition  to 
temperature  and  vibration,  there  are  several  factors  affecting 
electronic  reliability  that  have  high  uncertainty  and  cannot 
be  accurately  measured.  There  is  a  growing  trend  in  the  oil 
and  gas  industry  to  drill  faster  and  operate  at  higher 
temperatures  and  pressures,  forcing  tools  to  operate  beyond 
design  specifications.  This  has  resulted  in  increased  failure 
rate  leading  to  higher  maintenance  costs  and  system 
downtime  for  drilling  operators  as  well  as  service  providers. 
This  paper  develops  a  methodology  to  estimate  the  life  of 
drilling  electronics  by  using  operational  data,  drilling 
dynamics  and  historical  maintenance  information.  The 
methodology  combines  parameter  estimation  techniques, 
statistical  reliability  analysis  and  Bayesian  math  in  a 
probabilistic  framework.  Parameter  estimation  is  used  to 
calibrate  statistical  equations  to  field  data  and  probabilistic 
analysis  is  used  to  obtain  the  likelihood  of  failure.  In  the 
paper,  the  model  parameters  are  represented  as  random 
variables,  each  with  a  probability  distribution.  Drilling 
electronics  under  downhole  conditions  can  have  several 
failure  modes  and  each  failure  mode  can  be  caused  by  the 
interaction  of  several  variables.  When  information  on  each 
failure  mechanism  is  not  readily  available,  the  failure  is 
expressed  in  terms  of  several  candidate  models.  Bayesian 
updating  is  used  to  incorporate  real  time  operational  history 
for  a  specific  part  and  select  the  most  accurate  failure  model 
for  that  part.  Tis  is  for  the  first  time,  a  systematic  approach 
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is  developed  for  predicting  the  life  of  electronics  in 
downhole  drilling  environments  using  statistical  modeling 
and  probabilistic  methods  on  life  cycle  history  and 
operational  data  from  the  field. 

1.  Introduction 

Drilling  and  evaluation  operations  are  becoming  faster, 
more  accurate  and  safer,  thanks  to  modern  electronics  that 
enable  measurements,  storage  and  transmission  of 
information  in  real  time.  Transmitting  information  in  real 
time  makes  it  possible  to  evaluate  properties  of  earth’s 
formation  while  drilling  and  enable  directional  drillers  to 
steer  wells  towards  target  zones  more  efficiently.  The 
reliability  of  electronic  printed  circuit  board  assemblies 
(PCBAs)  in  the  bottomhole  assembly  (BHA)  is  the  key  to 
the  success  of  any  drilling  operation.  Drilling  electronics 
operate  in  extremely  harsh  downhole  environments  with 
temperatures  exceeding  150C,  shock  and  vibration  levels 
exceeding  15g.  The  impact  of  temperature,  shock  and 
vibration  on  the  life  of  electronics  is  described  by  Barker  et 
al.  (1992),  Duffek  (2004),  Garvey  et  al.  (2009),  Gingerich  et 
al.  (1999),  Tail  et  al.  (2005,  2007),  Mirgkizoudi  et  al. 
(2010),  Pecht  et  al.  (1999),  Vichare  (2006),  Vijayaragavan 
(2003),  Wassell  &  Stroehlein  (2010),  White  &  Bernstein 
(2008).  Other  factors  like  power  cycles,  thermal  ramp  rates, 
electrical  overstress,  mechanical  stress  and  manufacturing 
defects  impact  reliability  of  tools,  but  the  factors  cannot  be 
accurately  measured  in  downhole  drilling  environments  and 
encompass  high  uncertainty.  These  factors  can  act  alone  or 
interact  with  each  other  to  produce  several  degradation 
mechanisms  that  can  cause  failure.  For  example, 
Mirgkizoudi  et  al.  (2010)  demonstrated  through  tests  that 
there  is  significant  difference  between  the  lives  of  electronic 
components  subjected  to  thermal  testing  with  vibration  as 
compared  to  those  with  pure  thermal  loading.  Failure  of 
electronics  because  of  fatigue,  corrosion,  electromigration, 
filament  formation  and  dielectric  breakdown  has  been 
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established  by  the  scientific  community  (e.g.  Barker  et  al. 
1992,  Duffek  2004,  Gingerich  et  al.  1999,  Lah  et  al.  (2005, 
2007),  and  Pecht  et  al.  1999).  Typical  PCBAs  used  in  the 
drilling  industry  are  multiscale  devices  made  from  several 
components.  The  geometric  dimensions  of  individual 
components  may  vary  from  nanometers  to  inches.  This 
difference  creates  significant  challenges  in  developing  a 
predictive  model  for  failure  because  individual  components 
on  a  PCBA  may  fail  by  many  failure  modes  based  on  the 
operating  environmental  conditions.  Furthermore,  diagnosis 
of  faults  and  indicators  of  failure  is  difficult  because 
degradation  of  individual  components  may  not  lead  to  a 
measurable  loss  of  electrical  function  up  until  imminent 
failure.  There  is  growing  interest  in  the  area  of  health 
prognostics  for  electronic  components  through  the  use  of 
physics  based  models,  operating  data  from  fielded  products, 
design  qualification  testing  and  in-service  inspections  (e.g. 
Pecht  et  al.,  1999,  Vichare  2006,  and  Garvey  et  al.,  2009) 
The  main  drivers  behind  the  efforts  are  preventing  failure 
and  system  downtime,  reducing  costs  of  repair  and 
maintenance,  and  supporting  new  product  improvements.  A 
discussion  on  state  of  the  art  techniques  in  prognostics  and 
health  management  of  electronics  can  be  found  in  Pecht  et 
al.  (1999)  and  Vichare  (2006). 

The  method  of  measuring  failure  precursors  as  indicators  of 
impending  failure  is  based  on  the  hypothesis  that  degraded 
circuit  boards  produce  significantly  different  signatures 
from  defect  free  boards.  Failure  precursors  are  measurable 
indicators  that  can  be  correlated  with  subsequent  part 
failures.  Failure  indicators  for  electronics  like  shifts  and 
variation  in  temperature,  voltage,  current,  surface  insulation 
resistance  and  impedance  have  been  proposed  by  Born  & 
Boenning  (1989)  and  Pecht  et  al.  (1997,  1999).  Another 
area  of  research  in  electronics  prognostics  and  health 
management  (PHM)  is  usage  of  sacrificial  circuits  like 
fuses,  canaries,  circuit  breakers  and  self-diagnostics  sensors 
for  detecting  if  the  device  is  operating  outside  of  design 
limits.  These  devices  are  mounted  along  with  the  main 
electronic  component  but  have  accelerated  failure  rates  to 
provide  advance  warning  of  failure  (e.g.  Mishra  &  Pecht 
2002,  and  Ridgetop  Semiconductor  Sentinel  Silicon  report 
2004). 

The  physics  of  failure  (PoF)  based  approach  for  life 
prediction  uses  modeling  and  simulation  to  relate  the 
fundamental  physical  and  chemical  behavior  of  materials  to 
the  surrounding  environment  and  applied  loads.  The  PoF 
based  modeling  process  starts  by  exposing  the  product  to 
the  highly  accelerated  life  test  (HALT)  and  highly 
accelerated  stress  test  (HAST)  to  find  the  significant  modes 
and  root  cause  of  failure.  Next,  the  governing  equations  of 
the  failure  mechanisms  are  combined  with  the  data  gathered 
from  acceleration  tests  using  statistical  distributions.  The 
PoF  approach  has  been  successfully  applied  to  understand 
system  performance,  identify  weak  links  and  root  cause  of 
failure  so  that  they  can  be  mitigated  before  the  product  is 


launched.  Chatterjee  et  al.  (2012)  gives  a  historical 
perspective  of  the  evolution  of  the  physics  of  failure 
approach.  White  &  Bernstein  (2008)  present  the  state  of  the 
art  methods  for  PoF  modeling.  Finite  element  analysis  was 
used  to  model  fatigue  damage  growth  during  cyclic  loading 
(thermal,  mechanical  and  combination  of  both)  by  Barker  et 
al.  (1992),  Bailey  et  al.  (2007),  Dasgupta  (1993),  Duffek 
(2004),  Shinohara  &  Yu  (2010),  and  Vijayaragavan  (2003). 
Material  modeling  to  predict  degradation  of  solder  joints  in 
the  circuit  board  as  results  of  thermo  mechanical  fatigue  was 
developed  by  Nasser  &  Curtin  (2006).  Lah  et  al.  (2007) 
used  experimental  tests  in  combination  with  finite  element 
analysis  to  model  solder  joint  failure  from  shock  and 
vibration.  Mirgkizoudi  et  al.  (2010)  developed  a  test  plan  to 
evaluate  the  reliability  and  service  life  of  electronic 
components  that  are  subject  to  a  combination  of  mechanical, 
thermal,  chemical  or  electrical  inputs,  and  Wasseh  & 
Stroehlein  (2010)  use  accelerated  tests  to  derive 
accumulated  damage  models  and  failure  thresholds  as 
functions  of  vibration,  shock  levels,  the  number  of  shocks 
and  the  operating  temperature.  Young  &  Christou  (1994) 
developed  models  for  failure  because  of  electromigration. 
The  models  obtained  from  accelerated  tests  are  also  widely 
used  to  estimate  the  life  for  fielded  products  by  using  the 
governing  equation  to  scale  accelerated  test  life  to  that  under 
the  actual  operating  environment  in  the  field.  However,  such 
scaling  is  valid  only  if  the  following  conditions  are  met  (1) 
failure  modes  and  mechanisms  for  accelerated  stress  levels 
are  the  same  as  those  observed  in  the  field  and  (2)  variations 
of  material  properties  with  stress  levels  are  incorporated  in 
the  governing  equations.  Because  of  these  limitations,  it  has 
been  shown  for  practical  application  that  life  obtained  by 
scaling  the  highly  accelerated  life  tests  (HALT)  and  highly 
accelerated  stress  tests  (HAST)  is  orders  of  magnitude 
different  from  those  observed  in  actual  field  environments 
(e.g.  Osterman  2001,  Pecht  (1997,  1999),  and  White  & 
Bernstein  2008). 

Field  data  driven  methodologies  for  modeling  time  to  failure 
have  gained  momentum  because  of  the  availability  of  large 
volumes  of  data  and  limitations  of  physics  based  methods  to 
simulate  actual  operating  environment  in  laboratory  (e.g. 
Osterman,  M.,  2001  and  Vichare  2006).  This  methods  use 
operating  environment  measured  in  field,  repair  and 
maintenance  information  of  fielded  products  in  conjunction 
with  statistical  modeling  to  predict  the  life  of  parts  in 
operation.  For  example,  Hu  et  al.  (1991)  presented  a 
probabilistic  approach  for  predicting  thermal  fatigue  life  of 
wire  bonding  in  microelectronics,  and  Vichare  et  al.  (2007) 
developed  an  algorithm  to  extract  load  parameters  necessary 
for  assessing  damage  from  commonly  observed  failure 
mechanisms  in  electronics.  Sutherland  et  al.  (2003) 
developed  data  mining  methods  and  statistical  approaches  to 
obtain  accurate  life  distribution  for  power  plant  maintenance 
optimization. 
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There  is  a  growing  trend  in  the  oil  and  gas  industry  to  drill 
faster  and  operate  at  higher  temperatures  and  mechanical 
loads,  forcing  tools  to  operate  beyond  design  limits.  The 
capability  to  predict  performance  and  life  of  drilling 
electronics  is  critical  to  preventing  costly  downhole  tool 
failures  and  reducing  cost  of  maintenance.  This  paper 
presents  a  systemic  approach  for  deriving  and  updating 
models  for  time  to  failure  of  PCBAs  used  in  drilling  and 
evaluation  tools  using  field  data.  The  methodology 
combines  parameter  estimation  techniques,  statistical 
reliability  analysis  and  Bayesian  math  in  a  probabilistic 
framework.  Parameter  estimation  technique  is  used  to 
calibrate  statistical  equations  to  field  data  and  probabilistic 
analysis  is  used  to  obtain  the  likelihood  of  failure.  The 
model  parameters  are  represented  as  random  variables  with 
probability  distribution.  Drilling  electronics  within 
downhole  conditions  can  have  several  failure  modes  and 
each  failure  mode  can  be  caused  by  the  interaction  of 
several  variables.  When  information  on  each  failure 
mechanism  is  not  available  in  real  time,  the  failure  is 
expressed  in  terms  of  several  candidate  models.  Bayesian 
updating  is  used  to  incorporate  the  operational  load  history 
for  a  specific  part  and  selecting  the  most  accurate  failure 
model  for  the  part.  Results  presented  in  the  paper  show  that 
the  life  of  electronic  assemblies  used  in  drilling  and 
evaluations  can  be  predicted  accurately  by  using  the 
probabilistic  model  and  incorporating  operational  effects. 
Interaction  between  different  factors  causes  the  components 
to  degrade  faster  than  individual  factors  acting  alone. 

2.  Optimal  Maintenance  Planning 

The  framework  for  lifecycle  management,  optimal 
operations,  repair  and  maintenance  planning  of  drilling 
systems  requires  databases  to  record  equipment  lifecycle 
history,  environment  and  operations  data,  telemetry  and 
communication  systems,  sensor  and  measurement  systems 
and  algorithms  for  predicting  performance  and  consumed 
life.  Developing  an  optimal  maintenance  strategy  requires 
the  knowledge  of  component  life  as  a  function  of  usage. 
Predicting  component  life  accurately  requires  knowledge  of 
engineering  design,  physics  of  component  behavior  under 
operating  loads,  data  from  qualification  tests,  operating 
mission  of  fielded  products  and  indicators  of  degradation  of 
part  life  from  inspection  and  maintenance  shops.  The 
information  can  be  used  in  physics  based  or  statistical  data 
driven  models  (or  a  combination  of  both)  to  predict  part  life 
and  risk  of  failure  as  a  function  of  usage.  Once  accurate  life 
models  are  developed,  cost  factors,  performance  and 
reliability  targets  can  be  incorporated  to  optimize 
maintenance  plans  for  minimum  life  cycle  cost.  In  field 
operations,  life  extension  can  be  achieved  by  derating  the 
mission  (e.g.  lowering  rotational  speed  of  drill  to  reduce 
impact  of  vibration  induced  damage  on  BHA  components) 
so  that  parts  degrade  slower.  Cost  of  repair  and  maintenance 
can  be  lowered  by  using  a  risk  based  maintenance  level.  For 


example,  tools  with  low  risk  of  failure  can  be  given  a  quick 
turnaround,  medium  risk  entails  partial  disassembly  and 
inspection,  and  high  risk  tools  require  full  piece  part  level 
disassembly  and  inspection.  The  goal  of  this  method  is  to 
enable  reliability  and  maintenance  personnel  to  schedule 
timely  maintenance  and  prevent  costly  downhole  tool 
failures.  Fig.  1  shows  a  high  level  overview  of  data, 
methods  and  decision  process  for  optimizing  operations  and 
maintenance  plans. 


Figure  1 .  Methodology  for  optimal  operations  and  life 
management  of  parts. 

This  paper  develops  a  framework  to  provide  advance 
warning  of  impending  failure  so  that  high  risk  components 
can  be  retired.  The  remainder  of  the  paper  focuses  on 
algorithms  to  estimate  part  life  using  data  from  field  and 
maintenance  shops.  Section  3  gives  an  overview  of  parts  in 
the  bottomhole  assembly  (BHA)  for  which  reliability 
models  are  developed.  Section  4  describes  the  algorithms 
used  to  analyze  field  data  and  develop  mathematical  models 
for  time  to  failure.  Section  5  describes  the  methodology  to 
use  load  history  from  each  drilling  mission  (also  known  as  a 
“run”)  to  update  model  weights  and  predict  part  life.  Section 
6  presents  results  for  fielded  component  and  Section  7 
concludes  the  paper  with  a  summary  and  future  work. 

3.  Design  of  Bottom  hole  Assembly 

A  typical  drilling  system  comprises  a  drill  bit,  bottomhole 
assembly  (BHA);  drill  pipes  and  rig  (Fig.  2).  The  drill  bit 
is  a  rotary  cutting  tool  that  cuts  through  the  earth’s 
formation;  the  drilling  rig  is  a  structure  on  the  surface  that 
houses  equipment,  the  drill  pipes  provide  the  required 
extension  to  reach  a  target  depth  and  the  bottomhole 
assembly  (BHA)  is  a  structure  that  houses  drill  collars, 
reamers,  steering  system  and  electronic  components.  The 
focus  of  the  report  is  predicting  life  of  electronic 
components  in  BHA  of  the  AutoTrakGS  line  of  product 
manufactured  by  Baker  Hughes  Incorporated.  A  typical 
AutoTrakGS  contains  three  modules,  namely  (1)  the 
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AutoTrak  steering  system  (ASS)  that  provides  the  necessary 
drive  to  steer  the  bit  (2)  OnTrak  sensor  assembly  contains 
the  electronics  used  for  measurement  while  drilling  (MWD) 
and  logging  while  drilling  (LWD).  The  OnTrak  tool  takes 
measurements  like  resistivity,  gamma  ray,  pressure  and 
vibration.  (3)  Bi-directional  communication  and  power 
module  (BCPM).  This  module  sends  and  receives  data  to 
and  from  the  surface,  enabling  drillers  to  monitor  drilling 
operations  in  real  time  and  make  adjustments  when 
necessary.  The  BCPM  also  delivers  power  required  by  the 
other  modules  in  BHA.  The  three  assemblies  have 
components  that  are  critical  to  the  drilling  and  evaluation 
operation.  Failure  of  the  components  can  lead  to  the  loss  of 
functionality  and  cause  trip  for  failure  which  can  cost 
several  millions  of  dollars.  The  paper  focuses  on  developing 
predictive  life  models  of  several  such  components  in  the 
drilling  system. 


4.  Field  Data  Analytics 

Developing  field  data  driven  models  for  life  of  electronic 
assemblies  in  drilling  operations  is  challenging  for  two 
reasons.  First,  not  all  of  the  factors  impacting  component 
life  can  be  measured  in  real  time,  and  second,  the  data  that 
can  be  measured  has  errors  and  noise  because  of  limitations 
of  the  measurement  system  and  human  factors.  This  paper 
presents  method  to  calculate  the  reliability  of  components 
that  have  been  operated  at  varying  stress  level  because  of 
temperature  and  mechanical  loads  such  as  that  caused  due  to 
shock  and  vibrations.  The  Maintenance  and  Performance 
System  (MaPS™)  is  a  state  of  the  art  database  developed  by 
Baker  Hughes  Incorporated  to  track  equipment  lifecycle 
data.  Information  related  to  operations,  failure,  repair  and 
maintenance  is  stored  for  serialized  parts.  The  downhole 
environment  data  like  temperature,  vibration,  pressure  and 
power  cycles  is  also  maintained  in  the  MaPS  database.  The 
magnitude  and  cyclic  variation  of  temperature  can  cause 
solder  joint  fatigue  failure  in  electronic  circuit  components, 
chip  delamination,  corrosion,  electro  migration,  diffusion 


voids  and  dielectric  breakdown.  Extreme  vibrations 
influence  the  life  of  electronic  components  in  the  BHA. 
There  are  three  principal  modes  of  vibration:  (1)  axial 
vibration  along  the  tool  axis  can  cause  damage  to  seal  faces 
of  modular  connections,  stabilizers  and,  in  severe  cases,  can 
lead  to  buckling  fatigue.  Axial  vibration  is  responsible  for 
low  rates  of  penetration  and  reduced  efficiency,  (2)  lateral 
vibrations  occur  transversely  to  the  tool  axis.  Historically, 
they  are  the  most  destructive  type  of  vibrations  and  constant 
exposure  to  lateral  vibrations  can  cause  damage  to  tool 
electronics.  Constant  lateral  shocks  damage  the  tool  body  as 
well  as  greatly  reduce  drilling  efficiency,  (3)  stick  slip  is  a 
rotational  phenomenon  that  occurs  because  of  twisting  of 
the  drill  string.  Twisting  can  occur  when  the  bit  gets  stuck 
downhole  while  the  motor  continues  to  turn  the  drill  string. 
When  the  bit  is  free,  the  torsional  energy  stored  in  the  drill 
string  is  released,  causing  the  BHA  to  spin  in  the  opposite 
direction.  Stick  slip  can  lead  to  material  fatigue  and  physical 
damage  to  the  tool  and  electronics.  Figure  3  shows  the  three 
vibration  modes. 


Figure  3.  Vibration  modes  in  drill  string. 

4.1.  Consolidating  Life  Cycle  Data 

An  important  first  step  in  developing  a  life  model  is  to 
collect  life  cycle  history  for  each  part.  Each  serialized  part 
undergoes  one  of  three  maintenance  actions  during  its 
lifecycle:  (1)  repairs,  which  involve  replacing  damaged 
components  on  a  PCBA,  (2)  revision  upgrades  which  may 
include  repairs  and/or  firmware  updates,  (3)  scrapped 
because  of  failure  or  as  a  preventive  measure.  To  accurately 
capture  the  life  cycle  of  a  part,  the  accumulated  temperature 
and  vibration  hours  for  each  serialized  part  are  retrieved 
from  MaPS  database  and  grouped  using  the  steps  described 
in  Table  1.  The  purpose  of  the  steps  described  in  Table  1  is 
to  group  the  data  into  buckets  that  have  three  common 
characteristics,  namely  revision  id  flag,  repair  flag,  and 
revision  upgrade  flag.  Data  in  each  bucket  encompasses  the 
same  value  for  the  three  flags  and  any  two  buckets  have  at 
least  one  flag  different  between  them.  Eor  example,  the 
bucket  in  which  the  three  flags  are  [“A”,  N,  N]  implies  that 
parts  in  that  bucket  are  revision  “A”,  they  have  never  been 
repaired  and  never  received  a  revision  upgrade.  Another 
bucket  with  flags  [“A”,  N,  Y]  implies  that  parts  in  that 
bucket  have  never  been  repaired  and  have  been  upgraded  to 
revision  “A”  from  an  older  revision.  A  bucket  with  flags 
[“A”,  Y,  Y]  implies  that  all  parts  in  that  bucket  have  been 
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repaired  and  have  been  upgraded  to  revision  “A”  from  an 
older  revision. 

Table  1.  Process  to  group  part  life  cycle  data  for  failures, 
suspensions,  repairs  and  revision  upgrades. 


(1)  Find  all  the  serial  numbers  of  a  given  part  number  in 

the  database _ 

(2)  Select  a  serial  number  and  look  up  mission  profile  for 

that  serial  number  starting  with  installation  date _ 

(3)  Accumulate  drilling  hours,  circulating  hours  and  the 

operating  environment  variable  (temperature, 
vibration,  rotational  speed  (rpm),  distance  drilled)  etc. 
for  each  run;  store  the  accumulated  data  in  a  record 
with  index  i.  Store  the  revision  id  flag,  repair  flag 
(F/AO,  revision  upgrade  flag  (F/A),  and 
failure/suspension  flag  {F/S) _ 

(4)  Check  if  the  part  underwent  one  of  the  following 

actions  after  the  run  (a)  failed  and  scrapped,  (b)  failed 
and  repaired  to  put  back  in  service  (c)  upgraded  to 
next  revision  (d)  repaired  to  put  back  in  service  (e) 
scrapped  because  of  preventive  maintenance.  If  any  of 
the  above  is  true,  then  label  the  record  flag 
appropriately.  Create  a  new  record  /+1  and  go  to  step 
3.  If  none  of  steps  (a)-(d)  happened,  continue  to 
accumulate  the  fields  for  the  record  in  step  3 _ 

(5)  Check  if  all  the  runs  have  been  accounted  for  the  serial 

number.  If  no,  go  to  step  3;  otherwise,  create  a  new 
record  for  a  new  serial  number _ 

It  is  important  to  make  the  distinction  between  revision 
upgrade  and  repair  because  not  all  revision  upgrades  lead  to 
life  extension  (for  example,  if  only  firmware  is  changed  in 
revision  upgrade).  Grouped  data  is  filtered  for  outliers  and 
weighted  before  building  a  life  model  using  an  algorithm 
described  in  the  next  section. 


conventional  likelihood  maximization  procedure  where  all 
points  are  weighted  equally,  the  new  technique  iteratively 
maximizes  the  weighted  likelihood  function  of  life  data  until 
the  quality  of  model  shows  no  further  improvement. 
Iteratively  reweighted  maximum  likelihood  estimation 
procedures  assign  weight  that  is  inversely  proportional  to 
the  log-likelihood  of  the  data  point,  so  that  points  with 
lower  log-likelihood  are  weighted  less  than  points  with 
higher  log-likelihood.  Eventually,  the  model  moves  away 
from  outliers.  The  procedure  can  be  summarized  in  steps 

(l)-(4).  The  symbols  used  in  these  steps  have  the  following 
description. 

T  is  temperature,  L  is  lateral  vibration,  S  is  stick  slip  or 
rotational  vibration,  RPM  is  revolutions  per  minute,  ao  is  a 
constant  term,  a7...a„  are  coefficients  on  stress  variables  in 
the  life  equation  (e.g.  Eq.  A-1,  A-5  and  A-8),  w^pdated 
the  model  weight,  symbol  £  is  likelihood  of  data  point. 

( 1 )  Select  X  =  [T,  L,  S,  RPM,  LT,  ST,  LS,  SRPM]  for 
modeling  characteristic  life  function  described  in 
Appendix  A. 

(2)  Maximize  weighted  sum  of  likelihood  of  failure  and 

suspension  data  to  estimate  the  mean  and  variance  of 
parameters  of  the  characteristic  life  function  (e.g.  Eq. 
(A-1)  ao,  Oil  an)-  The  initial  weight  of  each  data  point  is 
unity.  The  maximization  of  likelihood  equation  is 
subjected  to  constraint  that  ao  >0  and  ay..  <0. 

(3)  Compute  the  value  of  likelihood  of  each  data  point  at  the 

values  of  a's  estimated  in  step  2.  Compute  the  mean 
and  standard  deviation  of  likelihood,  Lrnean  ^stdev  • 
The  updated  weight  w^p dated  hh  data  point  is  given 
by 

I  _  Total  number  of  data  points  P 

^updated  ~  71  7 

^ _ - _  ^stdev 

^stdev 


4.2.  Iteratively  Reweighted  Maximum  Likelihood 
Algorithm 

The  life  cycle  data  for  parts  recorded  in  the  maintenance 
database  is  large  and  complex  because  each  part  has  several 
hundred  serial  numbers  and  each  serial  number  has  the 
operating  history  for  several  drilling  runs.  Like  any  other 
physical  experiment,  data  can  have  errors  or  noise  because 
of  human  factors  and  flaws  in  the  measurement  system.  The 
impact  of  outliers  on  the  quality  of  the  predictive  model  can 
be  minimized  by  optimally  weighting  the  life  cycle  data. 
Outlier  identification  is  done  by  first  removing  data  points 
that  lead  to  constraint  violation  in  the  estimation  process. 
The  likelihood  equation  is  subjected  to  constraint  that  ao>0 
and  a  I  an  ^0  in  Eq.  A-1,  A-5  and  A-8.  The  inclusion  of 
these  constraints  implies  that  life  decreases  with  increase  in 
stress  level  due  to  temperature  and  vibration.  Next, 
iteratively  reweighted  maximum  likelihood  estimation 
(IRMLE)  technique  was  developed  to  determine  the  optimal 
weight  of  each  data  point  in  the  life  cycle  data.  Unlike 


(4)  Iterate  step  (2)  -  (3)  with  updated  model  weights  until 
the  sum  of  likelihood  has  converged  within  a  specified 
tolerance  (10'^  used  in  this  paper). 

In  principle  the  IRMLE  technique  is  similar  to  the 
iteratively  re  weighted  least  squares  (IRLS)  except  that  in 
IRMLE,  the  weighted  sum  of  likelihood  is  maximized, 
whereas  in  IRLS  the  weighted  sum  of  squares  of  difference 
between  data  and  model  response  is  minimized.  The  IRMLE 
algorithm  is  used  to  build  transfer  function  for  time  to 
failure  as  a  function  of  the  operating  mission  for  a  serialized 
part.  One  of  the  challenges  in  using  this  model  to  accurately 
estimate  remaining  life  is  that  the  operating  environment  is 
variable  throughout  the  life  of  a  component.  This  is 
overcome  by  updating  the  remaining  life  estimate  after  each 
drilling  mission  (life  of  a  part  can  span  several  drilling 
missions  and  each  mission  may  have  different  load  history 
and  hours).  The  application  of  this  algorithm  in  identifying 
outliers  is  presented  in  Fig.  A1  through  Fig.  A6  in  Appendix 
A. 
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5.  Reliability  Analysis 

Statistical  models  are  extensively  used  in  reliability  and  life 
data  analysis  to  estimate  time  to  failure  of  parts  in  operation. 
The  models  are  either  computational  simulations  or  a  set  of 
mathematical  equations  that  explain  the  general  state  of  a 
system  under  the  influence  of  load  and  time.  Typically,  a 
mathematical  model  is  an  approximation  of  the  physical 
phenomena  and  rarely  matches  the  field  observations. 
However,  for  practical  commercial  application  where  the 
models  are  used  in  design  and  operation  of  a  product,  it  is 
desirable  to  have  a  model  that  matches  the  field  or 
experimental  data  closely.  The  process  of  determining  the 
unknown  model  parameters  by  tuning  the  model  to  field 
data  is  called  parameter  estimation  or  model  calibration.  The 
model  parameter  usually  represents  quantities  that  have 
physical  significance  and  are  determined  by  imposing  some 
constraints  during  the  calibration  process.  The  constraints 
require  that  the  parameters  being  estimated  must  have 
minimum  variance  from  using  one  set  of  data  to  the  next 
and  the  estimated  value  is  bound  to  the  true  value.  A 
reliability  model  that  best  represents  the  life  cycle  of  a 
component  can  be  developed  when  sufficient  amount  of 
operation,  failure,  and  repair  and  maintenance  data  is 
available.  This  section  outlines  the  method  for  calibrating  a 
mathematical  model  to  field  data  and  its  subsequent 
application  to  predict  remaining  life  and  reliability  using 
real  time  mission  profile  for  a  specific  part. 

5.1.  Generating  Best  Fit  Model 

A  typical  time  to  failure  model  comprises  a  life  distribution 
function  to  incorporate  the  statistical  scatter  in  failure  time 
and  a  characteristics  life  function  (Appendix  A)  that 
describe  a  general  relation  between  failure  time  and  stress 
levels.  In  this  work,  the  Weibull,  lognormal  and  exponential 
distributions  are  used  to  build  time  to  failure  models.  The 
life  characteristic  can  be  any  life  measure  such  as  the  mean, 
median  or  hazard  rate  that  represents  a  bulk  property  of  the 
distribution.  The  life  characteristic  is  expressed  as  a  function 
of  stress  (as  shown  in  Appendix  A).  The  unknown 
parameter  of  the  composite  model  is  determined  by  tuning 
the  model  equation  to  field  data  using  the  Iterative 
Maximum  Likelihood  Estimation  technique.  The  method 
for  deriving  the  model  that  best  fits  the  field  data  is 
described  in  the  following  steps: 

(1)  Retrieve  life  cycle  data  from  maintenance  database 
and  bucketize  it  using  the  method  described  in  Section 

4.1. 

(2)  Select  a  revision  identifier,  trial  function  for  stress  rji 
and  trial  function  for  probability  distribution  fj  from 
Appendix  A.  Initialize  trial  functions,  /=  1,7=1. 

(3)  Calibrate  the  reliability  model to  the  bucketed 
field  data  using  IRMLE  technique.  Compute  standard 
deviation  in  parameter  estimates. 


(4)  Compute  goodness  of  fit  for  model  f(t,x)ij  by 
evaluating  prediction  error  sum  of  squares  (PRESS  ^). 

(5)  Select  new  probability  distribution  and  trial  function 
by  updating  values  of  i  and  j  and  repeat  steps  (2)  -  (4) 
until  all  trial  functions  are  evaluated. 

(6)  Generate  pareto  of  the  solution  obtained  from  steps  (1) 
-  (5)  with  two  objectives  namely,  goodness  of  fit  and 
Euclidean  norm  ^  on  coefficient  of  variation  of 
parameter  estimates. 

The  models  generated  by  steps  (l)-(4)  yield  pareto  of 
competing  solutions,  some  solutions  are  better  in  terms  of 
cross  validation  error  while  others  are  better  in  terms  of 
confidence  in  value  of  estimated  model  parameters  (a's 
described  in  Appendix  A).  The  time  to  failure  for  a  part  in 
operation  is  determined  using  the  method  described  in  the 
next  section. 

5.2.  Model  Selection  and  Updating  Using  Real  Time 
Data 

The  best  fit  model  is  representative  of  a  nominal^  part. 
Drilling  electronics  under  downhole  conditions  can  fail 
because  of  several  mechanisms  that  can  be  caused  by  the 
interaction  of  several  variables  (like  temperature,  vibration, 
and  power  cycles).  The  time  to  failure  is  expressed  as 
weighted  average  of  several  competing  models.  Bayesian 
updating  is  used  to  select  the  most  accurate  failure  model 
for  a  specific  part  by  using  the  real  time  mission  profile  for 
that  part.  Bayesian  updating  provides  a  systematic  process 
for  incorporating  real  time  operational  data  for  model 
selection  and  updating.  This  section  presents  Bayesian 
formulation  for  updating  probability  of  an  event  y  based  on 
recorded  observations  at  time  t  (examples  of  observations 
include  pass/fail  event  and  mission  profile  parameters  like 
temperature,  lateral  vibration,  stick  slip,  etc.).  More  details 
on  this  formulation  can  be  found  in  Zhang  and  Mahadevan, 
(2000).  The  symbol  Mi  is  the  model,  p(Mi)^  is  the 
probability  of  model  and  reflects  the  belief  that  the  model 
is  accurate  for  the  specific  part  in  operation,  p(y\Xi,  t,  Mj)  is 
the  probability  of  observing  an  outcome  y  at  time  t  using  the 
model,  the  vector  Xt  is  a  set  of  parameters  estimated  by 
the  calibration  procedure.  The  term  f(Xi\Mi)  is  the  joint 
probability  density  function  of  the  parameters  of  model. 


^  PRESS  is  adding  the  squared  of  difference  between  data 
and  model  prediction,  where  the  model  is  constructed  by 
excluding  one  data  point  and  repeating  this  over  all  the  data 
points. 

^  Euclidean  norm  of  an  n-dimensional  vector  space  is  given 
by  the  geometric  distance  from  origin  to  a  point  v. 

^  A  representative  part  that  has  a  life  equal  to  the  average  of 
several  part  produced  using  same  manufacturing  process 
and  operating  under  same  condition 
Note  that  X  p(Mi)  =  1.0 


524 


Annual  Conference  of  the  Prognostics  and  Health  Management  Society  2014 


The  event  is  the  state  of  the  part  at  a  time  t  that  has  one  of 
the  two  values  z  =  pass  or  fail. 

P(y)  =  YiLiP(Mi)j  piy\Xi,t,Mi)fiXi\Mi)dXi  (2) 

A-i 

The  prior  probability  p(Gi)  of  the  parameters  of  model  is 
given  by  Eq.  (3). 

p{Gd=viMdf{x,\Md  (3) 


p(Gi)  is  the  prior  probability  of  (M^,  Xi)  pair.  The  posterior 
probability  after  observing  an  outcome  for  y=z  is  given 
using  Bayes  theorem  in  Eq.  (4). 


p(Gi\y  =  z)  =  p((Mi\y  =  z))f(Xi\Mi,y  =  z) 


p(y  =  z\Gt)piMi)f(Xi\Mi) 

-  - - -  (4) 

P(y  =  z\x^,t,Mi)f(Xi\Mi)dxt 

Integrating  over  the  probability  distribution  of  Xf  in  Eq.  (4), 
the  posterior  model  weight  of  the  model  after  observing 
an  outcome  y=z  is  given  by  Eq.  (5). 


P(Mi)  J-.  p(,y=z\Gi)f(xi\Mi)dXi 

p(Mi\y  =  z)  = - - - 


(5) 


cycle  data  for  a  typical  low  voltage  power  supply  (EVPS) 
modem  used  in  drilling  operations  is  shown  in  Eig.  4  for 
parts  that  failed  in  field  and  Eig.  5  for  suspensions  (i.e.  parts 
that  are  operating  in  field.).  The  a  axis  on  the  plots 
represents  the  average  temperature  (lateral  vibration,  stick 
slip  and  interaction  effects  are  shown  in  Eig.  Al-Eig.  A6  in 
Appendix  A).  The  y-axis  represents  drilling  hours.  Each 
point  on  the  figure  is  a  unique  serial  number  of  the  part  and 
undergoes  different  mission  profile  during  their  life.  The 
data  shown  in  Eig.  4  is  derived  from  the  failure  of  parts  in 
operation  that  are  root  caused  and  Eig.  5  shows  data  for 
parts  that  are  either  currently  being  operated  or  those  that 
are  retired  for  precautionary  measures. 

Eig.  4  and  5  show  field  data  with  scatter  and  noise.  As  such, 
errors  and  noise  cannot  be  totally  eliminated  and  are  part  of 
field  data  because  of  limitations  of  the  measurement  system 
and  human  factors.  The  methodology  developed  in  the  paper 
is  used  to  reduce  the  scatter  in  the  life  prediction  by 
incorporating  the  cumulative  effect  of  temperature,  vibration 
and  their  interaction  on  life  consumption.  The  IRMLE 
algorithm  described  in  Section  4.2  is  applied  to  the  data  in 
Eig.  4  and  Eig.  5  and  the  outliers  (shown  in  red  dots)  are 
identified  by  the  algorithm.  The  data  in  Eig.  4  and  Eig.  A1 
through  Eig.  A3  shows  that  temperature  and  vibration  have 
a  detrimental  effect  on  life. 


It  is  important  to  note  that  the  time  t  used  in  Eq.  (2)  through 
Eq.  (5)  is  not  the  failure  time  but  it  is  the  time  at  which  an 
observation  is  made  regarding  the  pass  or  fail  state.  The 
expected  time  to  failure  is  obtained  by  weighted  sum  of  time 
to  failure  predicted  by  each  of  the  models  as  shown  in  Eq. 
(6). 

tfpredlcted  =  ZiLl  p(Mi\y  =  z)  X  tfu^  (6) 

Where  tf^y^^icted  is  the  expected  life  of  a  part  being 
modeled  and  t/M^is  the  life  predicted  by  the  model  whose 
probability  distribution  is  given  in  Appendix  A.  Equation  6 
is  solved  using  the  Monte  Carlo  simulation  technique.  Eor 
drilling  tools,  probability  of  failure  greater  than  10%  is 
unacceptable.  To  estimates  this  probability  accurately  we 
use  a  sample  size  of  10,000^in  Monte  Carlo  simulation. 

6.  Results 

The  methodology  developed  in  this  paper  is  used  to  predict 
life  of  fielded  electronic  assemblies  used  in  drilling  and 
evaluation  tools  and  advance  warning  of  impending  failure 
so  that  preventive  maintenance  can  be  scheduled.  The  life 


^  The  standard  deviation  in  probability  calculated  by  Monte 

Carlo  integration  is  given  by  •  For  a  target 

probability  of  50%  the  standard  deviation  is  0.005.  Hence 
10,000  samples  are  sufficient  to  estimate  probabilities  level 
of  interest  in  this  paper. 


Temperature  Vs  Drilling  Hours  To  Failure 
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Eigure  4.  Time  to  failure  vs.  temperature  severity  for  fielded 
EVPS  modem  serialized  parts. 
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Eigure  5.  Suspension  and  operational  severity  for  fielded 
EVPS  modem  serialized  parts. 
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Table  2  show  the  parameters  of  the  time  to  failure  model 
built  from  the  data  in  Fig.  4  and  5.  The  best  fit  model  is  a 
Weibull  distribution  with  a  characteristic  life  function 
whose  parameters  are  a  and  The  models  are  generated 
using  the  best  fit  procedure  described  in  Section  5.  The 
values  in  parenthesis  are  the  mean  and  standard  deviation  of 
the  parameter  estimates.  Each  of  the  models  in  Table  2  is 
comparable  in  terms  of  likelihood  value  and  confidence 
level  in  coefficients.  Model  M;  shows  the  interaction  of 
temperature  and  lateral  are  significant  factors  affecting  the 
life  of  the  part;  model  M2  shows  the  temperature  by  itself  is 
significant;  and  model  M3  shows  the  temperature  plus 
interaction  of  temperature  and  stick  slip  are  significant 
factors. 


Table  2.  Competing  Weibull  models  for  time  to  failure  of 
apart  as  a  function  of  operating  stress. 


Parameter 

Ml 

M2 

M3 

P(Md 

0.29 

0.40 

0.31 

0.0  (fi,  0) 

(7.5,  0.07) 

(8.0  0.1) 

(8.6,  0.1) 

T,  aj  (/ii,  a) 

0 

(-10.3,  0.7) 

(-7.9,  0.5) 

SxL,  a2  (jii,  0) 

0 

0 

(-43.8,3.1) 

TxL,  a3(ix,  0) 

(-39.3,  2.5) 

0 

0 

P(  0) 

(1.6,  0.08) 

(1.7,  0.07) 

(1.8,0.05) 

The  models  in  Table  2  represent  failure  time  for  a  nominal 
part  representative  of  the  population.  To  obtain  an 
individual  part  specific  prediction,  the  time  to  failure  is 
expressed  as  a  weighted  sum  of  failure  times  from  each  of 
the  models  using  the  operational  history  from  each  run  of 
that  specific  part  and  adjusting  the  relative  contribution  of 
each  model  using  the  Bayesian  formulation  in  Section  5.2. 
An  example  is  shown  for  predicting  the  time  to  failure  for  a 
single  part  in  operation.  Table  3  shows  the  load  history  on 
an  LVPS  modem  operated  for  1000  drilling  hours  at  varying 
levels  of  temperature  and  vibration.  The  first  column  of 
Table  3  shows  the  run  number  which  represents  the  mission 
between  the  start  and  stop  of  the  drilling  operation;  the 
second  column  shows  the  average  temperature  for  the  run; 
the  third  column  shows  the  average  lateral  vibration  level 
for  the  run;  and  the  fourth  column  shows  the  average 
torsional  vibration  level.  The  lateral  and  stick  slip  vibrations 
(reported  as  root  mean  square  in  units  of  acceleration 
because  of  gravity  g)  are  measured  by  accelerometers 
placed  in  the  drilling  assembly.  The  algorithm  described  in 
Section  5  is  applied  to  the  operational  history  after  each 
drilling  mission  (referred  as  a  “run”).  Starting  with  an  equal 
model  weight  of  0.33  for  the  three  models,  the  life 
prediction  and  model  weight  is  updated  after  each  run  to 
obtain  a  more  accurate  estimate  of  remaining  life  after  each 
run  (using  Eq.  3  through  Eq.  6).  The  final  value  of  model 
weights  prior  to  the  eighteenth  run  is  shown  in  second  row 
of  Table  2  for  each  of  the  three  candidate  model. 

The  life  expectancy  predicted  by  Eq.  6  (shown  in  Table  2) 
and  the  actual  hours  accumulated  on  the  part  after  each 


drilling  run  and  the  operating  environment  is  shown  in  Eig. 
6  and  Table  3.  Eigure  6  shows  the  true  remaining  useful  life 
(RUE)  and  95  percent  confidence  bounds  on  predicted  life. 
It  can  be  seen  that  the  true  RUE  is  bounded  between  the 
predicted  95%  confidence  interval.  This  interval  represents 
statistical  variation  in  part  life  of  the  population  of  identical 
parts  subjected  to  same  load  history.  The  variation  is  caused 
by  defects  in  manufacturing,  limitations  of  the  measurement 
system  and  human  factors  that  are  unknown  or  cannot  be 
modeled.  The  purple  diamonds  represent  the  actual  RUE  on 
the  part.  Eig.  6  shows  during  the  early  part  of  the  part  life 
cycle,  the  life  expectancy  is  high,  but  with  usage  and 
application  of  operating  loads,  the  accumulated  hours  begin 
falling  within  the  range  of  variation  of  expected  life.  At  that 
point,  the  component  is  retired  to  prevent  downhole  tool 
failure.  The  part  failed  during  the  nineteenth  drilling  run.  In 
retrospect,  the  model  accurately  predicted  impending  failure 
when  it  showed  that  the  part  was  at  high  risk  (>75%  risk  of 
failure)  from  the  seventeenth  run  and  should  have  been 
retired  at  that  time. 


Eigure  6.  Predicted  life  vs.  actual  drilling  hours  after  each 
run  for  LVPS  modem. 


Pig.  6  shows  that  the  expected  life  of  a  part  can  increase  or 
decrease  with  each  run  and  are  not  a  constant  number 
(because  expected  life  is  a  function  of  usage).  Table  3 
illustrates  the  concept  where  the  average  value  of 
operational  temperature  and  vibration  over  all  the  previous 
runs  is  calculated  in  columns  two  through  four.  The  first  run 
is  the  least  severe  and  has  the  highest  life  expectancy.  In 
subsequent  runs,  the  life  expectancy  reduces  as  the  severity 
of  operation  increases  as  shown  by  the  values  of 
temperature,  lateral  and  stick  slip  vibrations.  The  trend 
continues  until  the  ninth  run,  after  which  the  operational 
severity  starts  reducing,  leading  to  higher  life  expectancy 
until  the  thirteenth  run.  In  summary,  the  life  expectancy  can 
vary  through  the  operation  depending  on  the  severity  of 
operating  environment. 
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Table  3.  Average  operating  environment  and  risk  of  failure 
after  each  drilling  mission  (run)  during  life  of  a  part 


Run 

No. 

Average 

Temperature 

C 

Average 

Lateral 

(g_RMS) 

Average 

StickSlip 

(g_RMS) 

DrillHrs 

[h] 

Risk 

1 

57.6 

1.6 

0.2 

55.3 

0.00 

2 

63.8 

1.5 

0.1 

80.8 

0.00 

3 

57.6 

1.3 

0.3 

149.2 

0.00 

4 

71.9 

1.1 

0.2 

215.4 

0.00 

5 

74.9 

1.1 

0.2 

231.0 

0.00 

6 

72.0 

1.1 

0.2 

266.1 

0.00 

7 

70.1 

1.1 

0.2 

295.1 

0.00 

8 

77.3 

1.0 

0.3 

361.4 

0.00 

9 

81.8 

0.9 

0.3 

412.6 

0.00 

10 

78.9 

0.9 

0.3 

472.6 

0.00 

11 

76.5 

0.8 

0.3 

530.6 

0.00 

12 

73.0 

0.9 

0.2 

633.8 

0.00 

13 

71.2 

0.9 

0.2 

686.4 

0.00 

14 

71.7 

0.9 

0.3 

761.5 

0.00 

15 

73.3 

0.9 

0.3 

788.5 

0.03 

16 

75.5 

0.9 

0.2 

844.9 

0.25 

17 

79.6 

0.9 

0.2 

948.0 

0.85 

18 

78.6 

0.9 

0.2 

981.0 

0.90 

19 

78.4 

0.9 

0.2 

986.0 

0.87 

7.  Conclusions 

The  paper  presents  a  generic  methodology  to  predict  the  life 
of  electronic  components  used  in  drilling  and  evaluation 
tools.  Statistical  modeling  techniques  are  used  to  derive  best 
fit  mathematical  equations  for  durability  of  parts  from  field 
data.  The  method  is  applied  to  predict  life  of  electronic 
printed  circuit  boards  (PCBAs)  and  retire  high  risk 
components.  The  key  challenges  associated  with  developing 
durability  models  for  PCBAs  in  drilling  environment  are: 

(a)  Life  of  parts  is  impacted  by  several  factors,  not  all 
which  can  be  measured  accurately  because  of 
limitations  of  measurement  systems  and  human 
factors. 

(b)  Field  data  may  have  noise  and  errors  that  may 
affect  the  quality  of  predictive  model. 

(c)  Statistical  model  do  not  incorporate  physics  of 
degradation  and  may  not  be  applicable  for  all 
failure  mechanisms. 

The  methodology  addresses  the  aforementioned  challenges 
for  the  first  time  vis-a-vis  application  to  lifing  parts 
operating  in  downhole  drilling  environments.  The  key 
features  of  the  analysis  methodology  include: 


(a)  Algorithm  to  determine  life  from  cumulative 
damage  over  time  and  the  best-fit  mathematical 
model  using  a  combination  of  statistical 
distribution  and  characteristic  life  function. 

(b)  Clustering  mechanism  to  group  parts  life  cycle  data 
by  upgrades,  repair,  failures  and  suspensions. 

(c)  A  pattern  search  and  outlier  detection  algorithm  to 
identify  data  from  a  physical  degradation  trend. 

(d)  Iteratively  reweighted  maximum  likelihood 
estimation  method  to  determine  optimal  weights  of 
data  points. 

(e)  A  Bayesian  model  selection  technique  to 
incorporate  part  specific  operational  history  to 
obtain  improved  accuracy  in  life  prediction. 

Future  work  will  focus  on  improving  model  predictions  by 
using  additional  environment  variables  as  well  as  integrating 
data  from  design  and  qualification  tests. 

Nomenclature 

ASS  =  AutoTrak  steering  system 

BCPM  =  Bi-directional  communication  and  power  module 

BHA  =  Bottomhole  assembly 

HALT  =  Highly  accelerated  life  test 

HAST  =  Highly  accelerated  stress  test 

IRMLE=  Iteratively  reweighted  maximum  likelihood 

estimation. 

LVPS  =  Low  voltage  power  supply 

LWD  =  Logging  while  drilling 

MaPS  =  Maintenance  and  performance  system 

MLE  =  Maximum  likelihood  estimation 

MWD  =  Measurement  while  drilling 

PCBA  =  Printed  circuit  board  assembly 

PHM  =  Prognostics  and  health  management 

PoE  =  Physics  of  failure 

RPM  =  Revolutions  per  minute 

E  =  Failure 

L  =  Lateral  vibration 

Mi  =  model  identifier 

N  =  Symbol  used  to  represent  negative  decision,  generally 
“no”  or  “0” 

S  =  Symbol  used  to  represent  stick  slip  or  suspensions 
T  =  Temperature 

X  =  Vector  of  parameters  like  temperature  and  vibrations 
Y  =  Symbol  used  to  represent  affirmative  decision,  generally 
“yes”  or  “1” 

f=  Probability  density  function 
m  =  Number  of  models 
n  =  Number  of  records 
p  =  Probability 

p(a\b)  =  Conditional  probability  of  occurrence  of  event  a 

provided  b  is  true 

revid  =  Revision  identifier 

tf=  Time  to  failure  (drilling  hours) 

Wi  =  Weight  of  data  point 
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Xave  =  Average  value  of  parameter  v 
^stdev  =  Standard  deviation  of  parameter  v 
a  =  Calibration  parameters  of  reliability  model 
L  =  Likelihood 

rj  =  Characteristic  life  or  scale  factor  of  a  probability 
distribution 

^  =  Shape  factor  of  a  probability  distribution 
G  =  Standard  deviation 
2=  Hazard  function 

{ CF }  =  Set  of  life  data  for  confirmed  failure 

{0}  =  Set  of  outliers 

{S}  =  Set  of  life  data  for  suspension 

(UF }  =  Set  of  life  data  for  unconfirmed  failure 

Load,  Stress  and  Severity  are  used  interchangeably  to 

describe  the  impact  of  an  operational  environment 

(mechanical  and  thermal)  on  the  durability  of  parts. 

Nominal  part  is  a  representative  part  that  has  a  life  equal  to 
the  average  of  several  parts  produced  using  the  same 
manufacturing  process  and  operating  under  the  same 
condition. 

Run  refers  to  a  drilling  mission  that  can  last  for  several 
hours. 

Suspensions  are  used  in  reliability  modeling  to  represent 
hours  accumulated  on  parts  that  are  in  operation  or  removed 
from  service  for  reasons  other  than  failure. 
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Appendix  A 

A.  General  Log-Linear  Model 

The  relation  between  characteristics  of  life  and  stress 
variables  are  represented  by  using  one  of  the  three  models: 
generalized  log-linear  (GLL),  proportional  hazard  {PH)  or 
cumulative  damage  {CD).  The  GLL  model  represents  life 
using  Eq.  (A-1) 

r\{x)  =  +Iif=  1X7=1  (A-1) 


C.  Cumulative  Damage  Model 

The  cumulative  damage  model  is  designed  to  incorporate 
the  effect  of  varying  stress  on  life  of  components.  The 
model  takes  into  account  the  impact  of  damage  accumulated 
at  each  stress  level  on  the  reliability  of  the  part.  Damage 
accumulation  can  take  place  at  different  rates  for  different 
stress  levels  and  can  be  determined  using  the  linear  damage 
sum  (Miner’s  rule),  inverse  power  law  or  cycle  counting 
techniques  like  rain  flow  counting.  The  cumulative  damage 
model  used  in  the  paper  is  established  from  Miner’s  rule, 
which  is  based  on  the  hypothesis  that  if  there  are  n  different 
stress  levels  and  the  time  to  failure  at  the  stress  at  is  Tjfj, 
then  the  damage  fraction,  p,  is  given  by  Eq.  (A-7): 


Where  x  =  {T,  L,  5}.  Eor  a  Weibull  distribution,  the 
probability  density  function  is  shown  in  Eq.  (A-2),  where  P 
is  the  shape  parameter,  rj  is  the  scale  parameter  and  a 's  are 
unknown  parameters  calculated  from  field  data  using  the 
maximum  likelihood  estimation  technique. 

fit.x)  =  (A-2) 


The  probability  density  function  (PDE)  for  an  exponential 
distribution  can  be  obtained  by  putting  P=1  in  Eq.  (A-1). 
Eor  lognormal  distribution,  the  probability  density  function 
for  a  GLL  stress  function  is  shown  in  Eq.  (A-3): 


(A-3) 


B.  Proportional  Hazard  Model 

Eor  a  proportional  hazard  model,  the  hazard  rate  of  a 
component  is  affected  by  hours  in  operation  and  stress 
variables.  The  instantaneous  hazard  rate  of  a  part  is  given  by 
the  equation  as: 

=  Ao  (t)  j?  (x,  a)  (A-4) 


where  /  is  the  probability  density  function  and  R  is  the 
reliability  function.  The  instantaneous  hazard  rate  Xq  is  a 
function  of  time  only  and  the  stress  function  rj  is  function  of 
operating  stresses  like  temperature  or  vibration.  The  list  of 
unknown  model  parameter  a  is  obtained  by  calibrating 
model-to-test  data  using  maximum  likelihood  estimation 
(MLE).  The  stress  function  rj  is  given  by  Eq.  (A-5): 


r](x)  =  (A-5) 

Substituting  Eq.  (A-5)  in  Eq.  (A-2),  the  hazard  function  can 
be  written  for  a  Weibull  distribution  using  Eq.  (A-6): 


p  =  i:?=i^  (A-7) 

Where  ti  is  the  number  of  cycles  accumulated  at  stress  and 
failure  occurs  when  the  damage  fraction  equals  unity.  The 
probability  distribution  functions  for  Weibull  and  lognormal 
distributions  are  obtained  by  substituting  Eq.  (A-7)  in  Eqs 
(A-2)  and  (A-3),  respectively.  Given  the  stress  variables^  = 
[T,  L,  S,  RPM,  LxT,S  XT,LXS,S  X  RPM],  the  PDE  for  a 
Weibull  distribution  is  given  by: 

/(Lx)  =  I  - dt 

J  V 

0 

f{t,x)  =  j^5(Lx)(/(Lx))^  ^Q-{0C.x))) 

(A-8) 


D.  Characteristic  Life  Function 

The  life  characteristic  function  describes  a  general  relation 
between  failure  time  and  stress  levels.  The  life  characteristic 
can  be  any  time-to-failure  measure  such  as  the  mean, 
median  or  hazard  rate  that  represents  a  bulk  property  of  a 
probability  distribution.  Ideally,  the  function  incorporates 
the  governing  equations  that  represent  the  physical 
phenomenon  of  degradation  of  the  material  under 
application  of  load.  Typical  electronic  circuit  boards  used  in 
drilling  and  evaluations  are  complex  and  the  governing 
equations  representing  degradation  and  failure  mechanisms 
are  difficult  to  model;  hence,  the  paper  evaluates  several 
empirical  functions  between  stress  variables  and  selects  the 
one  that  best  fits  the  field  data. 

E.  Maximum  Likelihood  Estimation  and  Outlier 

Detection 

The  maximum  likelihood  estimation  (MLE)  obtains  the 
most  likely  values  of  parameters  that  best  describes  lifecycle 
data.  Typically,  the  life  cycle  data  of  a  part  contain  two  sets 
of  populations  (a)  hours  to  failure  on  samples  that  failed  in 
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an  experiment  or  in  field  and  (b)  hours  in  operation  for  parts 
that  are  either  currently  being  operated  or  those  that  are 
retired  for  precautionary  measures  but  were  fully  functional 
at  that  time. 


IniL)  =  X  N[  X  x 

Nf  X  ln{l  -  F(rsi,rj,/?))  -  x  Nf  x 

In  {(l  -  F(7^^  rj,  p))  -  (l  -  F(7^«,  rj.  p))}  (A-9) 


Where  the  initial  weight  of  each  data  point  is  given  by 


w;  = 


(A-IO) 


Fg  is  the  number  of  samples  for  which  the  exact  times-to- 
failure  is  known,  Nf  is  the  number  samples  for  which  the 
exact  time-to-failure  is  Tf,  f  is  the  probability  density 
function  (pdf)  for  time  to  failure,  rj  is  the  scale  factor  and  P 
shape  factor  of  the  pdf,  is  the  number  samples  for  which 
the  right  censoring  time  is  Tsi,  N-  is  the  number  samples  for 
which  the  left  censoring  time  is  and  right  censoring  time 
is  TP  .  The  W*  is  the  weight  of  data  subgroup  is 
determined  by  the  IRMLE  algorithm.  The  outliers  identified 
by  the  algorithm  are  shown  in  Fig.  Al-Fig.  A6  and  the 
comparison  of  estimated  life  versus  actual  drilling  hours  to 
failure  is  shown  in  Fig.  A7. 
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Lateral  Vs  Dnlling  Hours  To  Failure 
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StickSlip  Vs  Drilling  Hours  To  Failure 
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Figure  A2.  Time  to  failure  Vs.  stickslip  vibration  severity 
for  fielded  FVPS -modem  serialized  parts. 


Interaction  Effects  of  Temperature  and  Vibration  Vs  Drilling  Hours 
To  Failure 
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Figure  A3.  Impact  of  interaction  of  temperature  and 
vibration  on  failure  of  FVPS-modem  serialized  parts. 
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Figure  Al.  Time  to  failure  Vs.  lateral  vibration  severity  for 
fielded  FVPS-modem  serialized  parts. 


Figure  A4.  Suspension  time  Vs.  lateral  vibration  severity  for 
fielded  FVPS-modem  serialized  parts. 
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StickSlip  Vs  Drilling  Hours  For  Suspension 
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Figure  A5.  Suspension  time  Vs.  stickslip  vibration  severity 
for  fielded  L VPS -modem  serialized  parts. 
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Figure  A6.  Suspension  time  Vs.  interaction  effect  for  fielded 
L VPS -modem  serialized  parts. 
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Cross  Validation  of  Actual  Hours  Of  Failure  Vs  Model  Predictions 
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Figure  A7.  Comparison  of  actual  life  Vs.  predicted  mean 
life  for  parts  that  failed  in  field 
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